4 research outputs found

    Change Management Systems for Seamless Evolution in Data Centers

    Get PDF
    Revenue for data centers today is highly dependent on the satisfaction of their enterprise customers. These customers often require various features to migrate their businesses and operations to the cloud. Thus, clouds today introduce new features at a swift pace to onboard new customers and to meet the needs of existing ones. This pace of innovation continues to grow on super linearly, e.g., Amazon deployed 1400 new features in 2017. However, such a rapid pace of evolution adds complexities both for users and the cloud. Clouds struggle to keep up with the deployment speed, and users struggle to learn which features they need and how to use them. The pace of these evolutions has brought us to a tipping point: we can no longer use rules of thumb to deploy new features, and customers need help to identify which features they need. We have built two systems: Janus and Cherrypick, to address these problems. Janus helps data center operators roll out new changes to the data center network. It automatically adapts to the data center topology, routing, traffic, and failure settings. The system reduces the risk of new deployments for network operators as they can now pick deployment strategies which are less likely to impact users’ performance. Cherrypick finds near-optimal cloud configurations for big data analytics. It adapts allows users to search through the new machine types the clouds are constantly introducing and find ones with a near-optimal performance that meets their budget. Cherrypick can adapt to new big-data frameworks and applications as well as the new machine types the clouds are constantly introducing. As the pace of cloud innovations increases, it is critical to have tools that allow operators to deploy new changes as well as those that would enable users to adapt to achieve good performance at low cost. The tools and algorithms discussed in this thesis help accomplish these goals

    Black or White? How to Develop an AutoTuner for Memory-based Analytics [Extended Version]

    Full text link
    There is a lot of interest today in building autonomous (or, self-driving) data processing systems. An emerging school of thought is to leverage AI-driven "black box" algorithms for this purpose. In this paper, we present a contrarian view. We study the problem of autotuning the memory allocation for applications running on modern distributed data processing systems. For this problem, we show that an empirically-driven "white-box" algorithm, called RelM, that we have developed provides a close-to-optimal tuning at a fraction of the overheads compared to state-of-the-art AI-driven "black box" algorithms, namely, Bayesian Optimization (BO) and Deep Distributed Policy Gradient (DDPG). The main reason for RelM's superior performance is that the memory management in modern memory-based data analytics systems is an interplay of algorithms at multiple levels: (i) at the resource-management level across various containers allocated by resource managers like Kubernetes and YARN, (ii) at the container level among the OS, pods, and processes such as the Java Virtual Machine (JVM), (iii) at the application level for caching, aggregation, data shuffles, and application data structures, and (iv) at the JVM level across various pools such as the Young and Old Generation. RelM understands these interactions and uses them in building an analytical solution to autotune the memory management knobs. In another contribution, called GBO, we use the RelM's analytical models to speed up Bayesian Optimization. Through an evaluation based on Apache Spark, we showcase that RelM's recommendations are significantly better than what commonly-used Spark deployments provide, and are close to the ones obtained by brute-force exploration; while GBO provides optimality guarantees for a higher, but still significantly lower compared to the state-of-the-art AI-driven policies, cost overhead.Comment: Main version in ACM SIGMOD 202

    Infinite CacheFlow in software-defined networks

    No full text
    Software-Defined Networking (SDN) enables fine-grained poli-cies for firewalls, load balancers, routers, traffic monitoring, and other functionality. While Ternary Content Address-able Memory (TCAM) enables OpenFlow switches to pro-cess packets at high speed based on multiple header fields, today’s commodity switches support just thousands to tens of thousands of rules. To realize the potential of SDN on this hardware, we need efficient ways to support the abstraction of a switch with arbitrarily large rule tables. To do so, we de-fine a hardware-software hybrid switch design that relies on rule caching to provide large rule tables at low cost. Unlike traditional caching solutions, we neither cache individual rules (to respect rule dependencies) nor compress rules (to preserve the per-rule traffic counts). Instead we“splice ” long dependency chains to cache smaller groups of rules while preserving the semantics of the network policy. Our design satisfies four core criteria: (1) elasticity (combining the best of hardware and software switches), (2) transparency (faith-fully supporting native OpenFlow semantics, including traf-fic counters), (3) fine-grained rule caching (placing popular rules in the TCAM, despite dependencies on less-popular rules), and (4) adaptability (to enable incremental changes to the rule caching as the policy changes)

    Rule-Caching Algorithms for Software-Defined Networks

    No full text
    Software-Defined Networking (SDN) allows control applications to install fine-grained forwarding policies in the underlying switches, using a standard API like OpenFlow. Highspeed Ternary Content Addressable Memory (TCAM) allows hardware switches to store these rules and perform a parallel lookup to quickly identify the highest-priority match for each packet. While TCAM enables fast lookups with flexible wildcard rule patterns, the cost and power requirements limit the number of rules the switches can support. To make matters worse, these hardware switches cannot sustain a high rate of updates to the rule table. In this paper, we show how to give applications the illusion of highspeed forwarding, large rule tables, and fast updates by combining the best of hardware and software processing. Our CacheFlow system “caches ” the most popular rules in the small TCAM, while relying on software to handle the small amount of “cache miss ” traffic. However, we cannot blindly apply existing cache-replacement algorithms, because of dependencies between rules with overlapping patterns. Rather than cache large chains of dependent rules, we “splice ” long dependency chains to cache smaller groups of rules while preserving the semantics of the policy. Experiments with our CacheFlow prototype—on both real and synthetic workloads and policies—demonstrate that rule splicing makes effective use of limited TCAM space, while adapting quickly to changes in the policy and the traffic demands. 1
    corecore